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Abstract 

The  HMAX  model  has  recently  been  proposed  by  Riesenhuber  &  Poggio  [15]  as  a  hierarchical  model  of 
position-  and  size-invariant  object  recognition  in  visual  cortex.  It  has  also  turned  out  to  model  successfully 
a  number  of  other  properties  of  the  ventral  visual  stream  (the  visual  pathway  thought  to  be  crucial  for  ob¬ 
ject  recognition  in  cortex),  and  particularly  of  (view-tuned)  neurons  in  macaque  inferotemporal  cortex,  the 
brain  area  at  the  top  of  the  ventral  stream.  The  original  modeling  study  [15]  only  used  "paperclip"  stimuli, 
as  in  the  corresponding  physiology  experiment  [8],  and  did  not  explore  systematically  how  model  units' 
invariance  properties  depended  on  model  parameters.  In  this  study,  we  aimed  at  a  deeper  understanding 
of  the  irmer  workings  of  HMAX  and  its  performance  for  various  parameter  settings  and  "natural"  stimu¬ 
lus  classes.  We  examined  HMAX  responses  for  different  stimulus  sizes  and  positions  systematically  and 
found  a  dependence  of  model  units'  responses  on  stimulus  position  for  which  a  quantitative  description 
is  offered.  Scale  invariance  properties  were  found  to  be  dependent  on  the  particular  stimulus  class  used. 
Moreover,  a  given  view-tuned  unit  can  exhibit  substantially  different  invariance  ranges  when  mapped 
with  different  probe  stimuli.  This  has  potentially  interesting  ramifications  for  experimental  studies  in 
which  the  receptive  field  of  a  neuron  and  its  scale  invariance  properties  are  usually  only  mapped  with 
probe  objects  of  a  single  t5q)e. 
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1  Introduction 

Models  of  neural  information  processing  in  which  in¬ 
creasingly  complex  representation  of  a  stimulus  are 
gradually  built  up  in  a  hierarchy  have  been  considered 
ever  since  the  seminal  work  of  Hubei  and  Wiesel  on  re¬ 
ceptive  fields  of  simple  and  complex  cells  in  cat  striate 
cortex  [2].  The  HMAX  model  has  recently  been  pro¬ 
posed  as  an  application  of  this  principle  to  the  prob¬ 
lem  of  invariant  object  recognition  in  the  ventral  vi¬ 
sual  stream  of  primates,  thought  to  be  crucial  for  ob¬ 
ject  recognition  in  primates  [15].  Neurons  in  the  in- 
ferotemporal  cortex  (IT),  the  highest  visual  area  in  the 
ventral  stream,  do  not  only  respond  selectively  to  com¬ 
plex  stimuli,  their  response  to  a  preferred  stimulus  is 
also  largely  independent  of  the  size  and  position  of 
the  stimulus  in  the  visual  field  [5].  Similar  properties 
are  achieved  in  HMAX  by  a  combination  of  two  differ¬ 
ent  computational  mechanisms:  a  weighted  linear  sum 
for  building  more  complex  features  from  simpler  ones, 
akin  to  a  template  match  operation,  and  a  highly  non¬ 
linear  "MAX"  operation,  where  a  unit's  output  is  deter¬ 
mined  by  its  most  strongly  activated  input  unit.  "MAX" 
pooling  over  afferents  tuned  to  the  same  feature,  but 
at  different  sizes  or  positions,  yields  robust  responses 
whenever  this  feature  is  present  within  the  input  image, 
regardless  of  its  size  or  position  (see  Figure  1  and  Meth¬ 
ods).  This  model  has  been  shown  to  account  well  for  a 
number  of  crucial  properties  of  information  processing 
in  the  ventral  visual  stream  of  humans  and  macaques 
(see  [6, 14, 16-18]),  including  view-tuned  representation 
of  three-dimensional  objects  [8],  response  to  mirror  im¬ 
ages  ]9],  recognition  in  clutter  ]11],  and  object  catego¬ 
rization  ]1,  6]. 

Previous  studies  using  HMAX  have  employed  a  fixed 
set  of  parameters,  and  did  not  examine  in  a  system¬ 
atic  way  the  dependencies  of  model  unit  tuning  prop¬ 
erties  on  its  parameter  settings  and  the  specific  stimuli 
used.  In  this  study,  we  examined  the  effects  of  varia¬ 
tions  of  stimulus  size  and  position  on  the  responses  of 
model  units  and  its  impact  on  object  recognition  per¬ 
formance  in  detail,  using  two  different  stimulus  classes: 
paperclips  (as  used  in  the  original  publication  ]15])  and 
cars  (as  used  in  ]17]),  to  gain  a  deeper  understanding 
of  how  such  IT  neuron  invariance  properties  could  be 
influenced  by  properties  of  lower  areas  in  the  visual 
stream.  This  also  provided  insight  into  the  suitability 
of  standard  HMAX  feature  detectors  for  "natural"  ob¬ 
ject  classes. 

2  Methods 

2.1  The  HMAX  model 

The  HMAX  model  of  object  recognition  in  the  ventral 
visual  stream  of  primates  has  been  described  in  detail 
elsewhere  ]15].  Briefly,  input  images  (we  used  128  x  128 
or  160  X  160  greyscale  pixel  images)  are  densely  sam¬ 


pled  by  arrays  of  two-dimensional  Gaussian  filters,  the 
so-called  SI  units  (second  derivative  of  Gaussian,  ori¬ 
entations  0°,  45°,  90°,  and  135°,  sizes  from  7  x  7  to 
29  X  29  pixels  in  two-pixel  steps)  sensitive  to  bars  of 
different  orientations,  thus  roughly  resembling  proper¬ 
ties  of  simple  cells  in  striate  cortex.  At  each  pixel  of 
the  input  image,  filters  of  each  size  and  orientation  are 
centered.  The  filters  are  sum-normalized  to  zero  and 
square-normalized  to  1,  and  the  result  of  the  convolu¬ 
tion  of  an  image  patch  with  a  filter  is  divided  by  the 
power  (sum  of  squares)  of  the  image  patch.  This  yields 
an  SI  activity  between  -1  and  1. 

In  the  next  step,  filter  bands  are  defined,  i.e.,  groups 
of  SI  filters  of  a  certain  size  range  (7  x  7  to  9  x  9  pix¬ 
els;  11  X  11  to  15  X  15  pixels;  17  x  17  to  21  x  21  pixels; 
and  23  x  23  to  29  x  29  pixels).  Within  each  filter  band,  a 
pooling  range  is  defined  (variable  poolRange)  which  de¬ 
termines  the  size  of  the  array  of  neighboring  SI  units 
of  all  sizes  in  that  filter  band  which  feed  into  a  G1  unit 
(roughly  corresponding  to  complex  cells  of  striate  cor¬ 
tex).  Only  SI  filters  with  the  same  preferred  orientation 
feed  into  a  given  G1  unit  to  preserve  feature  specificity. 
As  in  ]15],  we  used  pooling  range  values  from  4  for  the 
smallest  filters  (meaning  that  4x4  neighboring  SI  fil¬ 
ters  of  size  7x7  pixels  and  4x4  filters  of  size  9x9  pix¬ 
els  feed  into  a  single  G1  unit  of  the  smallest  filter  band) 
over  6  and  9  for  the  intermediate  filter  bands,  respec¬ 
tively,  to  12  for  the  largest  filter  band.  The  pooling  oper¬ 
ation  that  the  G1  units  use  is  the  "MAX"  operation,  i.e.,  a 
G1  unit's  activity  is  determined  by  the  strongest  input 
it  receives.  That  is,  a  G1  unit  responds  best  to  a  bar  of 
the  same  orientation  as  the  SI  units  that  feed  into  it,  but 
already  with  an  amount  of  spatial  and  size  invariance 
that  corresponds  to  the  spatial  and  filter  size  pooling 
ranges  used  for  a  G1  unit  in  the  respective  filter  band. 
Additionally,  G1  units  are  invariant  to  contrast  reversal, 
much  as  complex  cells  in  striate  cortex,  by  taking  the 
absolute  value  of  their  SI  inputs  (before  performing  the 
MAX  operation),  modeling  input  from  two  sets  of  sim¬ 
ple  cell  populations  with  opposite  phase.  Possible  firing 
rates  of  a  G1  unit  thus  range  from  0  to  1.  Furthermore, 
the  receptive  fields  of  the  G1  units  overlap  by  a  certain 
amount,  given  by  the  value  of  the  parameter  clOverlap. 
We  mostly  used  a  value  of  2  (as  in  ]15]),  meaning  that 
half  the  SI  units  feeding  into  a  G1  unit  were  also  used  as 
input  for  the  adjacent  G1  unit  in  each  direction.  Higher 
values  of  cl  Overlap  indicate  a  greater  degree  of  overlap 
(for  an  illustration  of  these  arrangements,  see  also  Fig¬ 
ure  5). 

Within  each  filter  band,  a  square  of  four  adjacent, 
nonoverlapping  G1  units  is  then  grouped  to  provide 
input  to  a  S2  unit.  There  are  256  different  t5rpes  of  S2 
units  in  each  filter  band,  corresponding  to  the  4^  possi¬ 
ble  arrangements  of  four  G1  units  of  each  of  four  t5rpes 
{i.e.,  preferred  bar  orientation).  The  S2  unit  response 
function  is  a  Gaussian  with  mean  1  {i.e.,  {1, 1, 1, 1})  and 
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Figure  1:  Schematic  of  the  HMAX  model.  See  Methods. 


standard  deviation  1,  i.e.,  an  S2  unit  has  a  maximal  fir¬ 
ing  rate  of  1  which  is  attained  if  each  of  its  four  affer- 
ents  fires  at  a  rate  of  1  as  well.  S2  units  provide  the 
feature  dictionary  of  HMAX,  in  this  case  all  combina¬ 
tions  of  2  X  2  arrangements  of  "bars"  (more  precisely. 
Cl  cells)  at  four  possible  orientations. 

To  finally  achieve  size  invariance  over  all  filter  sizes 
in  the  four  filter  bands  and  position  invariance  over  the 
whole  visual  field,  the  S2  units  are  again  pooled  by  a 
MAX  operation  to  yield  C2  units,  the  output  units  of  the 
HMAX  core  system,  designed  to  correspond  to  neurons 
in  extrastriate  visual  area  V4  or  posterior  IT  (PIT).  There 
are  256  C2  units,  each  of  which  pools  over  all  S2  units  of 
one  type  at  all  positions  and  scales.  Consequently,  a  C2 
unit  will  fire  at  the  same  rate  as  the  most  active  S2  unit 
that  is  selective  for  the  same  combination  of  four  bars, 
but  regardless  of  its  scale  or  position. 

C2  units  then  again  provide  input  to  the  view- 
tuned  units  (VTUs),  named  after  their  property  of  re¬ 
sponding  well  to  a  certain  two-dimensional  view  of 
a  three-dimensional  object,  thereby  closely  resembling 
the  view-tuned  cells  found  in  monkey  inferotemporal 
cortex  by  Logothetis  et  al.  [8].  The  C2  ^  VTU  con¬ 
nections  are  so  far  the  only  stage  of  the  HMAX  model 
where  learning  occurs.  A  VTU  is  tuned  to  a  stimulus  by 
selecting  the  activities  of  the  256  C2  units  in  response  to 
that  stimulus  as  the  center  of  a  256-dimensional  Gaus¬ 
sian  response  function,  yielding  a  maximal  response  of 
1  for  a  VTU  in  case  the  C2  activation  pattern  exactly 
matches  the  C2  activation  pattern  evoked  by  the  train¬ 
ing  stimulus.  To  achieve  greater  robustness  in  case  of 


cluttered  stimulus  displays,  only  those  C2  units  may 
be  selected  as  afferents  for  a  VTU  that  respond  most 
strongly  to  the  training  stimulus  [14].  We  ran  sim¬ 
ulations  with  the  40,  100,  and  256  strongest  afferents 
to  each  VTU.  An  additional  parameter  specifying  re¬ 
sponse  properties  of  a  VTU  is  its  cr  value,  or  the  stan¬ 
dard  deviation  of  its  Gaussian  response  function.  A 
smaller  a  value  yields  more  specific  tuning  since  the  re¬ 
sultant  Gaussian  has  a  narrower  half-maximum  width. 

2.2  Stimuli 

We  used  the  "8  car  system"  described  in  [17],  created  us¬ 
ing  an  automatic  3D  multidimensional  morphing  sys¬ 
tem  [19].  The  system  consists  of  morphs  based  on  8 
protot5rpe  cars.  In  particular,  we  created  lines  in  morph 
space  cormecting  each  of  the  eight  protot5q)es  to  all  the 
other  prototypes  for  a  total  of  28  lines  through  morph 
space,  with  each  line  divided  into  10  intervals.  This  cre¬ 
ated  a  set  of  260  unique  cars,  and  induces  a  similarity 
metric:  any  two  prototypes  are  spaced  10  morph  steps 
apart,  and  a  morph  at  morph  distance,  e.g.,  3  from  a 
protot5q)e  is  more  similar  to  this  protot5q)e  than  another 
morph  at  morph  distance  7  on  the  same  morph  line.  Ev¬ 
ery  car  stimulus  was  viewed  from  the  same  angle  (left 
frontal  view). 

In  addition,  we  used  75  out  of  a  set  of  200  paper¬ 
clip  stimuli  (15  targets,  60  distractors)  identical  to  those 
used  by  Logothetis  et  al.  in  [8],  and  in  [15].  Each  of 
those  was  viewed  from  a  single  angle  only.  Unlike  in 
the  case  of  cars,  where  features  change  smoothly  when 
morphed  from  one  prototype  to  another,  paperclips  lo- 
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Figure  2:  Examples  of  the  car  and  paperclip  stimuli 
used.  Actual  stimulus  presentations  contained  only  one 
stimulus  each. 

cated  nearby  in  parameter  space  can  appear  very  dif¬ 
ferent  perceptually,  for  instance,  when  moving  a  ver¬ 
tex  causes  two  previously  separate  clip  segments  to 
cross.  Thus,  we  did  not  examine  the  impact  of  para¬ 
metric  shape  variations  on  recognition  performance  for 
the  case  of  paperclips. 

Examples  of  car  and  paperclip  stimuli  are  provided 
in  Eigure  2.  The  background  pixel  value  was  always  set 
to  zero. 

2.3  Simulations 

2.3.1  Stimulus  transformations 

Translation.  To  study  differences  in  HMAX  output 
due  to  shifts  in  stimulus  position,  we  trained  VTUs  to 
the  8  car  protot5^es  and  15  target  paperclips  positioned 
along  the  horizontal  midline  and  in  the  left  half  of  the 
input  image.  We  then  calculated  C2  and  VTU  responses 
to  those  training  stimuli  and  all  remaining  252  car  and 
60  paperclip  stimuli  for  60  other  positions  along  the  hor¬ 
izontal  midline,  spaced  single  pixels  apart.  (Results  for 
vertical  stimulus  displacements  are  qualitatively  iden¬ 
tical,  see  Eigure  3a.)  To  examine  the  effects  of  shift¬ 
ing  a  stimulus  beyond  the  receptive  field  limits  of  C2 
cells,  and  to  find  out  how  invariant  response  proper¬ 
ties  of  VTUs  depend  on  the  stimulus  class  presented, 
we  displayed  5  cars  and  5  paperclips  in  isolation  within 
a  too  X  too  pixel-sized  image,  at  all  positions  along  the 
horizontal  and  vertical  midlines,  including  positions 
where  the  stimulus  all  but  disappeared  from  the  image. 
Responses  of  the  10  VTUs  trained  to  these  car  and  pa¬ 
perclip  stimuli  (when  centered  within  the  100  x  100  im¬ 
age)  to  all  these  stimuli,  of  both  stimulus  classes,  were 
then  calculated. 


Scaling.  To  examine  size  invariance,  we  trained  VTUs 
to  each  of  the  8  car  protot5q)es  and  each  of  the  15  target 
paperclips  at  size  64  x  64  pixels,  positioned  at  the  center 
of  the  input  image.  We  then  calculated  C2  and  VTU 
responses  for  all  cars  and  paperclips  at  different  sizes, 
in  half-octave  steps  {i.e.,  squares  with  edge  lengths  of 
16,  24,  32,  48,  96,  and  128  pixels,  and  additionally  160 
pixels),  again  positioned  at  the  center  of  the  160  x  160 
input  image. 

2.3.2  Assessing  the  impact  of  different  filters  on 
model  unit  response 

To  investigate  the  effects  of  different  filter  sizes  on 
overall  model  unit  activity,  we  performed  simulations 
using  individual  filter  bands  (instead  of  the  four  in  stan¬ 
dard  HMAX  [15]).  The  filter  band  source  of  a  C2  unit's 
activity  {i.e.,  which  filter  band  was  most  active  for  a 
given  S2  /  C2  feature  and  thus  determined  the  C2  unit's 
response)  could  be  determined  by  rurming  HMAX  on 
a  stimulus  with  only  one  filter  band  active  at  a  time 
and  comparing  these  responses  with  the  response  to  the 
same  stimulus  when  all  filter  bands  were  used. 

2.3.3  Recognition  tasks 

To  assess  recognition  performance,  we  used  two  dif¬ 
ferent  recognition  paradigms,  corresponding  to  two  dif¬ 
ferent  behavioral  tasks. 

"Most  Active  VTU"  paradigm.  In  the  first  paradigm, 
a  target  is  said  to  be  recognized  if  the  VTU  tuned  to 
it  fires  more  strongly  to  the  test  image  than  all  other 
VTUs  tuned  to  other  members  of  the  same  stimulus  set 
{i.e.,  the  7  other  car  VTUs  in  the  case  of  cars,  or  the  14 
other  paperclip  VTUs  in  the  case  of  paperclips).  Recog¬ 
nition  performance  in  a  given  condition  {e.g.,  for  a  cer¬ 
tain  stimulus  size)  is  100%  if  this  holds  true  for  all  proto¬ 
types.  Chance  performance  here  is  always  the  inverse 
of  the  number  of  VTUs  {i.e.,  protot5q)es),  since  for  any 
given  stimulus  presentation,  the  probability  that  any 
VTU  is  the  most  active  is  1  over  the  number  of  VTUs. 
This  paradigm  corresponds  to  a  psychophysical  task 
in  which  subjects  are  trained  to  discriminate  between 
a  fixed  set  of  targets,  and  have  to  identify  which  of 
them  appears  in  a  given  presentation.  We  will  refer  to 
this  way  of  measuring  recognition  performance  as  the 
"Most  Active  VTU"  paradigm. 

"Target-Distractor  Comparison"  paradigm.  Alterna¬ 
tively,  a  target  stimulus  can  be  considered  recognized 
in  a  certain  presentation  condition  if  the  VTU  tuned  to 
it  responds  more  strongly  to  its  presentation  than  to  the 
presentation  of  a  distractor  stimulus.  If  this  holds  for  all 
distractors  presented,  recognition  performance  for  that 
condition  and  that  VTU  is  100%.  Chance  performance 
is  reached  at  50%  in  this  paradigm,  i.e.,  when  a  VTU  re¬ 
sponds  stronger  or  weaker  to  a  distractor  than  to  the  tar¬ 
get  for  equal  numbers  of  distractors.  This  corresponds 
to  a  two-alternative  forced-choice  task  in  psychophysics 
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Figure  3:  Effects  of  stimulus  displacement  on  VTUs  tuned  to  cars,  (a)  Response  of  a  VTU  (40  afferents,  a  —  0.2)  to 
its  preferred  stimulus  at  different  positions  in  the  image  plane,  (b)  Mean  recognition  performance  of  8  car  VTUs 
(a  —  0.2)  for  different  positions  of  their  preferred  stimuli  and  two  different  numbers  of  C2  afferents  in  the  Most 
Active  VTU  paradigm.  100%  performance  is  achieved  if,  for  each  prototype,  the  VTU  tuned  to  it  is  the  most  active 
among  the  8  VTUs.  Chance  performance  would  be  12.5%.  (c)  Same  as  (b),  but  for  the  Target-Distractor  Comparison 
paradigm.  Chance  performance  would  be  50%.  Note  smaller  variations  for  greater  numbers  of  afferents  in  both 
paradigms.  Periodicity  (30  pixels)  is  indicated  by  arrows.  Apparently  smaller  periodicity  in  (b)  is  due  to  the  lower 
number  of  distractors,  producing  a  coarser  recognition  performance  measure,  (d)  Mean  recognition  performance 
in  the  Target-Distractor  Comparison  paradigm  for  car  VTUs  with  40  afferents  and  a  =  0.2,  plotted  against  stimulus 
position  and  distractor  similarity  ("morph  distance";  see  Methods). 


in  which  subjects  are  presented  with  a  sample  stimulus 
(chosen  from  a  fixed  set  of  targets)  and  two  choice  stim¬ 
uli  (one  of  them  being  the  sample  target,  and  the  other 
being  a  distractor  of  varying  similarity  to  the  target)  and 
have  to  indicate  which  of  the  two  choice  stimuli  is  iden¬ 
tical  to  the  sample.  We  will  refer  to  this  paradigm  as  the 
"Target-Distractor  Comparison"  paradigm. 

For  paperclips,  distractors  in  the  latter  paradigm 
performance  were  60  clips  randomly  chosen  from  the 
whole  set  of  200  clips.  We  thus  used  exactly  the  same 
method  to  assess  recognition  performance  as  in  [14]  for 
double  stimuli  and  cluttered  scenes.  For  cars,  we  either 
chose  all  259  nontarget  cars  as  distractors  for  a  given  tar¬ 
get  {i.e.,  protot5^e)  car,  as  in  Figures  3c  and  10c.  Or,  for 
a  given  target  car,  we  used  only  those  car  stimuli  as  dis¬ 
tractors  that  were  morphs  on  any  of  the  7  morph  lines 
leading  away  from  that  particular  target  (including  the 
7  other  car  protot5^es).  By  grouping  those  distractors 


according  to  their  morph  distance  from  the  target  stim¬ 
ulus,  we  could  assess  the  differential  effects  of  using 
more  similar  or  more  dissimilar  distractors  in  addition 
to  the  effects  of  variations  in  stimulus  size  or  position. 
This  additional  dimension  is  plotted  in  Figures  3d  and 
lOd. 

3  Results 

3.1  Changes  in  stimulus  position 

We  find  that  the  response  of  a  VTU  to  its  preferred  stim¬ 
ulus  depends  on  the  stimulus'  position  within  the  im¬ 
age  (Figure  3a).  Moving  a  stimulus  away  from  the  po¬ 
sition  it  occupied  during  training  of  the  VTU  results  in 
decreased  VTU  output.  This  can  lead  to  a  drop  in  recog¬ 
nition  performance,  i.e.,  another  VTU  tuned  to  a  differ¬ 
ent  stimulus  might  fire  more  strongly  to  this  stimulus 
than  the  VTU  actually  tuned  to  it  (Figure  3b),  or  the 
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Figure  4:  Effects  of  stimulus  displacement  on  VTUs  tuned  to  paperclips,  (a)  Responses  of  VTUs  with  40  or  256 
afferents  and  a  =  0.2  to  their  preferred  stimuli  at  different  positions.  Note  larger  variations  in  VTU  activity  for 
more  afferents,  as  opposed  to  variations  in  recognition  performance,  which  are  smaller  for  more  afferents.  (b)  Mean 
recognition  performance  of  15  VTUs  tuned  to  paperclips  (u  =  0.2)  in  the  Most  Active  VTU  paradigm,  depending  on 
position  of  their  preferred  stimuli,  for  two  numbers  of  C2  afferents.  Chance  performance  would  be  6.7%.  (c)  Same 
as  (b),  but  for  the  Target-Distractor  Comparison  paradigm.  Chance  performance  would  be  50%. 


VTU  might  respond  more  strongly  to  a  nontarget  stim¬ 
ulus  at  that  position  (Figure  3c).  This  is  more  likely  for  a 
similar  than  for  a  dissimilar  distr actor  (Figure  3d).  The 
changes  in  VTU  output  and  recognition  performance 
depend  on  position  in  a  periodic  fashion,  but  they  are 
qualitatively  identical  across  stimulus  classes  (see  Fig¬ 
ure  4  for  paperclips).  Larger  variations  in  recognition 
performance  for  cars  are  due  to  the  fact  that,  on  aver¬ 
age,  any  two  car  stimuli  are  more  similar  to  each  other 
(in  terms  of  C2  activation  patterns)  than  any  two  paper¬ 
clips. 

The  "wavelength"  A  of  these  "oscillations"  in  output 
and  recognition  performance  is  found  to  be  a  function 
of  both  the  spatial  pooling  range  of  the  Cl  units  {pool- 
Range)  and  their  spatial  overlap  {clOverlap)  in  the  dif¬ 
ferent  filter  bands,  in  a  manner  described  as  follows: 


A  =  Icmi 


ceil 


/  poolRange- 
\clOverlap  ^ 


(1) 


with  i  rurming  from  1  to  the  number  of  filter  bands,  and 
Icm  being  the  lowest  common  multiple.  Thus,  standard 
HMAX  parameters  (pooling  range  4,  6,  9,  or  12  SI  units 


for  Cl  units  in  the  four  filter  bands,  respectively;  Cl 
overlap  2)  yield  a  A  value  which  is  the  least  common 
multiple  of  2, 3, 5,  and  6,  i.e.,  30.  This  means  that  chang¬ 
ing  the  position  of  a  stimulus  by  multiples  of  30  pixels  in 
X-  or  j/-direction  does  not  alter  C2  or  VTU  responses  or 
recognition  performance.  (Smaller  A  values  for  paper¬ 
clips  apparent  in  Figure  4  derive  from  the  dominance  of 
small  filter  bands  activated  by  the  paperclip  stimuli,  as 
discussed  later  and  in  Figure  11). 

These  modulations  can  be  explained  by  a  "loss  of  fea¬ 
tures"  occurring  due  to  the  way  Cl  and  S2  units  sample 
the  input  image  and  pool  over  their  afferents.  This  is 
depicted  in  Figure  5  for  only  one  filter  size  {i.e.,  a  single 
spatial  pooling  range).  A  feature  of  the  stimulus  in  its 
original  position  (symbolized  by  two  solid  bars)  is  de¬ 
tected  by  adjacent  Cl  units  that  feed  into  the  same  S2 
unit.  Moving  the  stimulus  to  the  right  can  position  the 
right  part  of  the  feature  beyond  the  limits  of  the  right 
Cl  unit's  receptive  field,  while  the  left  feature  part  is 
still  detected  by  the  left  Cl  unit.  Consequently,  this  fea¬ 
ture  is  "lost"  for  the  S2  unit  these  Cl  units  feed  into. 
However,  the  feature  is  not  detected  by  the  next  S2  unit 
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Figure  5:  Schematic  illustration  of  feature  loss  due  to 
a  change  in  stimulus  position.  In  this  example,  the  Cl 
units  have  a  spatial  pooling  range  of  4  {i.e.,  fhey  pool 
over  an  array  of  4  x  4  SI  unifs)  and  an  overlap  of  2 
{i.e.,  fhey  overlap  by  half  fheir  pooling  range).  Only  2 
of  4  Cl  afferenfs  fo  an  S2  unit  are  shown.  See  text. 

to  the  right,  either  (whose  afferent  Cl  units  are  depicted 
by  dotted  outlines  in  Figure  5),  since  the  left  feature  part 
does  not  yet  fall  within  the  limits  of  fhis  S2  unif's  leff 
Cl  afferent.  Not  until  the  stimulus  has  been  shifted  far 
enough  so  fhaf  all  feafures  can  again  be  defecfed  by  fhe 
nexf  sef  of  S2  unifs  fo  fhe  righf  will  fhe  oufpuf  of  HMAX 
again  be  idenfical  fo  fhe  original  value.  As  can  be  seen 
in  Figure  5,  fhis  is  precisely  fhe  case  when  fhe  sfimulus 
is  shiffed  by  a  disfance  equal  fo  fhe  offsef  of  fhe  Cl  unifs 
wifh  respecf  fo  each  ofher,  which  in  furn  equals  fhe  quo- 
fient  poolRange/clOverlap.  If  mulfiple  fiber  bands  are 
used,  each  of  which  contribufes  fo  fhe  overall  response, 
the  more  general  formula  given  in  Eq.  1  applies. 

Nofe  fhaf  sfimulus  posifion  during  VTU  framing  is 
in  no  way  special,  and  shiffing  fhe  sfimulus  fo  a  dif¬ 
ferent  position  might  just  as  well  cause  "emergence" 
of  feafures  nof  defecfed  af  fhe  original  posifion.  How¬ 
ever,  due  fo  fhe  VTUs'  Gaussian  funing,  any  devia- 
fion  of  fhe  C2  activation  paffern  from  fhe  framing  paf- 
fem  will  cause  a  decrease  in  VTU  response,  regardless 
of  whefher  individual  C2  unifs  display  a  sfronger  or 
weaker  response. 

To  improve  performance  during  recognition  in  duf¬ 
fer,  only  a  subsef  of  fhe  256  C2  unifs  -  fhose  which  re¬ 
spond  besf  fo  fhe  original  sfimulus  -  may  be  used  as 
inpufs  fo  a  VTU,  as  described  in  [14].  This  resulfs  in 
a  less  pronounced  variation  of  a  VTU's  response  when 
ifs  preferred  sfimulus  is  presenfed  af  a  nonopfimal  po¬ 
sifion,  simply  because  fewer  ferms  appear  in  fhe  expo- 
nenf  of  fhe  VTU's  Gaussian  response  function  so  fhaf 
fewer  deviafions  from  fhe  opfimal  C2  unif  acfivify  are 
summed  up  (see  Figure  4a).  On  fhe  ofher  hand,  choos¬ 
ing  a  smaller  value  for  fhe  cr  paramefer  of  a  VTU's  Gaus¬ 
sian  response  function  -  corresponding  fo  a  sharpening 
of  ifs  funing  -  leads  fo  larger  variations  in  its  response 


Figure  6:  Dependence  of  maximum  sfimulus  shiff- 
relafed  differences  in  VTU  acfivify  on  pooling  range 
and  G1  overlap.  Resulfs  shown  are  for  a  f5q)ical  VTU 
wifh  256  afferenfs  and  cr  =  0.2  funed  fo  a  car  sfimulus 
using  only  fhe  largesf  SI  fiber  band  (23  x  23  fo  29  x  29 
pixels)  employed.  Arrows  indicafe  sfandard  HMAX  pa¬ 
ramefer  sef  tings. 


for  shiffed  sfimuli,  since  fhe  response  will  drop  more 
sharply  already  for  a  small  change  of  fhe  G2  acfivation 
paffern.  In  any  case,  if  should  be  nofed  fhaf  even  sharp 
drops  in  VTU  acfivify  need  nof  enfail  a  corresponding 
decrease  in  recognifion  performance.  In  facf,  while  us¬ 
ing  a  greafer  number  of  afferenfs  fo  a  VTU  increases 
fhe  magnifude  of  acfivify  flucfuafions  due  fo  changing 
sfimulus  posifion,  fhe  variafions  in  recognifion  perfor¬ 
mance  are  acfually  smaller  than  for  fewer  afferenfs  (see, 
for  example.  Figures  3b,  c,  and  4b,  and  c).  This  is  likely 
due  fo  fhe  increasing  separafion  of  sfimuli  as  fhe  dimen- 
sionalify  of  fhe  feafure  space  increases. 

Figure  6  shows  fhe  maximum  differences  in  VTU  ac- 
tivify  encounfered  due  fo  variafions  in  sfimulus  posi¬ 
fion  for  differenf  pooling  ranges  and  Cl  overlaps,  us¬ 
ing  only  fhe  largesf  SI  fiber  band  (from  23  x  23  pixels 
fo  29  X  29  pixels).  Acfivify  modulafions  are  generally 
greafer  for  larger  pooling  ranges  and  smaller  Cl  over¬ 
laps,  since  fhe  inpuf  image  is  sampled  more  coarsely 
for  such  paramefer  seffrngs,  and  fhus  fhe  chance  for 
a  given  feafure  fo  become  "losf"  is  greafer.  Nofe  fhaf 
for  a  pooling  range  of  16  SI  unifs  and  a  Cl  overlap  of 
4,  VTU  response  is  subjecf  fo  larger  variafions  at  dif¬ 
ferent  stimulus  positions  than  for  a  pooling  range  of  8 
and  Cl  overlap  of  2,  even  fhough  bofh  cases,  in  accor¬ 
dance  wifh  equafion  1,  share  fhe  same  A  value.  This 
can  be  explained  by  fhe  facf  fhaf  large  SI  fibers  only  de- 
fecf  large-scale  variafions  in  fhe  image.  If  neverfheless 
small  Cl  pooling  ranges  are  used,  fhe  resulfing  small 
receptive  fields  of  S2  unifs  will  experience  only  minor 
differences  in  fheir  inpuf  when  fhe  sfimulus  is  shiffed 
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Figure  7:  Illustration  of  different  effects  of  sfimulus  dis- 
placemenf  for  differenf  Cl  pooling  ranges  and  large  SI 
fillers.  As  in  Figure  5,  boundaries  of  Cl  /  S2  recepfive 
fields  are  drawn  around  centers  of  SI  recepfive  fields. 
Acfual  SI  recepfive  fields,  however,  are  of  course  much 
larger,  especially  for  large  filler  sizes,  (a)  Large  SI  fil¬ 
lers  yield  large  feafures.  For  small  Cl  pooling  ranges, 
only  few  neighboring  SI  cells  feed  info  a  Cl  cell,  and 
fhey  largely  defecf  fhe  same  feafure  due  fo  coarse  filfer- 
ing  and  since  fhey  are  only  spaced  single  pixels  aparf. 
Hence,  moving  a  sfimulus  does  nof  change  inpuf  fo  a  Cl 
cell  much,  (b)  Cl  /  S2  unifs  wifh  larger  pooling  ranges 
are  more  likely  fo  defecf  complex  feafures  (indicafed  by 
fwo  shaded  bars  insfead  of  only  one  in  (a))  even  if  fhe 

51  fibers  used  are  large.  Thus,  in  fhis  sifuafion.  Cl  / 

52  cells  are  more  suscepfible  fo  variafions  in  sfimulus 
position. 


by  a  few  pixels  (see  Figure  7).  Conversely,  when  small 
SI  fillers  are  used,  fhe  tillered  image  varies  mosf  on  a 
small  scale,  and  smaller  pooling  ranges  will  yield  larger 
variafions  in  HMAX  oufpuf  fhan  larger  pooling  ranges 
(nof  shown). 

The  posifion-dependenf  modulations  of  C2  unif  ac- 
fivify  (Figure  8),  one  level  below  fhe  VTUs,  are  in  agree- 
menf  wifh  fhe  observations  made  earlier.  The  addi- 
fional  drop  in  acfivify  for  fhe  leffmosf  sfimulus  posi¬ 
tions  is  due  fo  an  edge  effecf  af  fhe  corner  of  fhe  im¬ 
age.  Since  care  has  been  faken  in  HMAX  fo  ensure 
fhaf  all  SI  fillers,  even  fhose  cenfered  on  pixels  af  fhe 
image  boundary,  receive  an  equal  amounf  of  inpuf  (fo 
achieve  fhis,  fhe  inpuf  image  is  infernally  padded  wifh 
a  "frame"  of  zero-value  pixels),  even  a  feafure  sifuafed 
af  an  image  boundary  cannof  slip  fhrough  fhe  array  of 
Cl  unifs.  This  is  differenf,  however,  for  S2  unifs;  an  S2 
unif  fhaf  defecfs  such  a  feafure  via  ifs  fop  righf  and/ or 
bottom  righf  Cl  afferenf  mighf  nof  be  able  fo  defecf  fhis 
feafure  any  more  if  if  is  positioned  af  fhe  far  leff  of  fhe 
inpuf  image. 

3.2  Changes  in  stimulus  size 

Size-invariance  in  object  recognition  is  achieved  in 
HMAX  by  pooling  over  units  that  respond  best  to  the 
same  feature  at  different  sizes.  As  has  already  been 
shown  in  [15],  this  works  well  over  a  wide  range  of 
sfimulus  sizes  for  paperclip  stimuli:  VTUs  funed  fo  pa- 
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Figure  8:  Mean  acfivify  of  all  256  C2  unifs  plotted 
againsf  position  of  a  car  sfimulus.  Nofe  fhaf  in  fhe  simu¬ 
lation  run  fo  generafe  fhis  graph,  spatial  pooling  ranges 
of  fhe  four  fitter  bands  used  were  sfill  4,  6,  9,  and  12 
SI  unifs,  respectively,  as  in  previous  figures,  while  fhe 
Cl  overlap  value  was  changed  fo  3.  Thus,  in  agreemenf 
wifh  equation  1,  fhe  periodicify  of  acfivify  changes  was 
reduced  fo  12  pixels. 


perclips  of  size  64  x  64  pixels  display  very  robusf  recog¬ 
nition  performance  for  enlarged  presenfafions  of  fhe 
clips  and  also  good  performance  for  reduced-size  pre¬ 
senfafions  (see  Figure  9). 

Inferesfingly,  wifh  cars  a  differenf  picfure  emerges. 
As  can  be  seen  in  Figure  10b  and  c,  recognition  perfor¬ 
mance  in  bofh  paradigms  drops  considerably  bofh  for 
enlarged  and  downsized  presenfafions  of  a  car  sfimu¬ 
lus.  Figure  lOd  shows  fhaf  fhis  low  performance  does 
nof  even  increase  if  rafher  dissimilar  disfracfors  are 
used  (corresponding  fo  higher  values  on  fhe  morph 
axis).  Shrinking  a  sfimulus  undersfandably  decreases 
recognition  performance  since  ifs  characferisfic  feafures 
quickly  disappear  due  fo  limifed  resolution.  Figure  11 
suggesfs  a  reason  why  performance  for  car  sfimuli 
drops  for  increases  in  size  as  well.  While  paperclip  sfim¬ 
uli  (size  64  X  64  pixels)  elicif  a  response  mosfly  in  fitter 
bands  1  and  2  (confaining  fhe  smaller  fitters  from  7x7 
fo  15  X  15  pixels  in  size),  car  sfimuli  of  fhe  same  size 
mosfly  acfivafe  fhe  large  fitters  (23  x  23  fo  29  x  29  pix¬ 
els)  in  fitter  band  4.  This  is  mosf  probably  due  fo  fhe  facf 
fhaf  our  "clay  model-like"  rendered  cars  confain  very 
little  infernal  sfrucfure  so  fhaf  fheir  mosf  conspicuous 
feafures  are  fheir  ouflines.  Consequenfly  enlarging  a 
car  sfimulus  will  blow  up  ifs  characferisfic  feafures  (fo 
which,  affer  all,  fhe  VTUs  are  framed)  beyond  fhe  scale 
fhaf  can  effecfively  be  defecfed  by  fhe  sfandard  SI  fitters 
of  fhe  model,  reducing  recognifion  performance  consid¬ 
erably. 

Mafhemafically,  bofh  scaling  and  franslafion  are  ex- 
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Figure  9:  Effects  of  stimulus  size  on  VTUs  tuned  to  paperclips  of  size  64  x  64  pixels,  (a)  Responses  of  VTUs  with 
40  or  256  afferents  and  cr  =  0.2  to  their  preferred  stimuli  at  different  sizes,  (b)  Mean  recognition  performance  of 
15  VTUs  tuned  to  paperclips  {a  =  0.2)  in  the  Most  Active  VTU  paradigm,  depending  on  the  size  of  their  preferred 
stimuli,  for  two  different  numbers  of  C2  afferents.  Horizontal  line  indicates  chance  performance  (6.7%).  (c)  Same  as 
(b),  but  for  the  Target-Distractor  Comparison  paradigm.  Horizontal  line  indicates  chance  performance  (50%). 


amples  of  2D  affine  transformations  whose  effects  on 
an  object  can  be  estimated  exactly  from  just  one  object 
view.  The  two  transformations  are  also  treated  in  the 
same  fashion  in  HMAX,  by  MAX-pooling  over  afferents 
tuned  to  the  same  feature,  but  at  different  positions  or 
scales,  respectively.  However,  it  appears  that,  while  the 
behavior  of  the  model  for  stimulus  translation  is  similar 
for  the  two  object  classes  we  used,  the  scale  invariance 
ranges  differ  substantially.  This  is,  however,  most  likely 
not  due  to  a  fundamental  difference  in  the  representa¬ 
tion  of  these  two  stimulus  transformations  in  a  hierar¬ 
chical  neural  system.  It  has  to  be  taken  into  consider¬ 
ation  that,  in  HMAX,  there  are  more  units  at  different 
positions  for  a  given  receptive  field  size  than  there  are 
units  with  different  receptive  field  sizes  for  a  given  posi¬ 
tion.  Moreover,  stimulus  position  was  changed  in  a  lin¬ 
ear  fashion  in  our  experiments,  pixel  by  pixel,  and  only 
within  the  receptive  fields  of  the  C2  units,  while  stim¬ 
ulus  size  was  changed  exponentially,  making  it  more 
likely  that  critical  features  appear  at  a  scale  beyond  de¬ 
tectability.  Conversely,  with  a  broader  range  of  SI  filter 
sizes  and  smaller,  linear  steps  of  stimulus  size  variation, 
similar  periodical  changes  of  VTU  activity  and  recog¬ 


nition  performance  as  with  stimulus  translation  might 
be  observed.  Or,  recognition  performance  for  transla¬ 
tion  of  stimuli  could  depend  on  stimulus  class  in  an 
analogous  manner  as  found  here  for  scaling  if,  for  ex¬ 
ample,  for  a  certain  stimulus  class  the  critical  features 
were  located  only  at  a  single  position  or  a  small  number 
of  nearby  positions  within  the  stimulus,  which  would 
cause  the  response  to  change  drastically  when  the  cor¬ 
responding  part  of  the  object  was  moved  out  of  the  re¬ 
ceptive  field  (see  also  next  section). 

However,  control  experiments  with  larger  SI  filter 
sizes  (up  to  59  x  59  pixels)  failed  to  improve  recog¬ 
nition  performance  for  scaled  car  stimuli  over  what 
was  observed  in  Figure  10,  because  in  this  case,  again 
the  largest  filters  were  most  active  already  in  response 
to  a  car  stimulus  of  size  64  x  64  pixels  (not  shown). 
This  makes  clear  that,  especially  if  neural  processing  re¬ 
sources  are  limited,  recognition  performance  for  trans¬ 
formed  stimuli  depends  on  how  well  the  feature  set  is 
matched  to  the  object  class.  Indeed,  paperclips  are  com¬ 
posed  of  features  that  the  standard  HMAX  S2  /  C2  units 
apparently  capture  quite  well,  namely  combinations  of 
bars  of  various  orientations,  and  the  receptive  field  sizes 
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Figure  10:  Effects  of  stimulus  size  on  VTUs  tuned  to  cars  of  size  64  x  64  pixels,  (a)  Responses  of  VTUs  with  40  or 
256  afferents  and  a  =  0.2  to  their  preferred  stimuli  at  different  sizes,  (b)  Mean  recognition  performance  of  8  car 
VTUs  (a  =  0.2)  for  different  sizes  of  their  preferred  stimuli  and  two  different  numbers  of  C2  afferents  in  the  Most 
Active  VTU  paradigm.  Chance  performance  (12.5%)  indicated  by  horizontal  line,  (c)  Same  as  (b),  but  for  the  Target- 
Distractor  Comparison  paradigm.  Chance  performance  (50%)  indicated  by  horizontal  line,  (d)  Mean  recognition 
performance  in  the  Target-Distractor  Comparison  paradigm  for  car  VTUs  with  40  afferents  and  a  =  0.2,  plotted 
against  stimulus  size  and  distractor  similarity  ("morph  distance"). 


of  the  S2  filters  they  activate  most  are  in  good  corre¬ 
spondence  with  their  own  size.  Consequently,  the  dif¬ 
ferent  filter  bands  present  in  the  model  can  actually  be 
employed  appropriately  to  detect  the  critical  features  of 
paperclip  stimuli  at  different  sizes,  leading  to  a  high  de¬ 
gree  of  scale  invariance  for  this  object  class  in  HMAX. 

3.3  Influence  of  stimulus  class  on  invariance 
properties 

Results  in  the  previous  section  demonstrated  that 
model  VTU  responses  and  recognition  performance  can 
depend  on  the  particular  stimulus  class  used,  and  on 
how  well  the  features  preferentially  detected  by  the 
model  match  the  characteristic  features  of  stimuli  from 
that  class.  This  was  shown  to  be  an  effect  of  the  S2  level 
features.  Therefore,  we  would  expect  to  see  different  in¬ 
variance  properties  not  only  for  VTUs  tuned  to  different 
objects,  but  also  for  a  single  VTU  when  probed  with  dif¬ 
ferent  stimuli,  depending  on  how  well  the  shape  of  the 


different  probe  stimuli  is  matched  by  the  S2  features. 

Figure  12  shows  that  this  is  indeed  possible.  Panel  (a) 
displays  responses  of  a  car  VTU  to  its  preferred  stim¬ 
ulus  and  a  paperclip  stimulus,  respectively,  at  varying 
stimulus  positions  including  positions  beyond  the  re¬ 
ceptive  fields  of  the  VTU's  C2  afferents.  While  differ¬ 
ent  strengths  of  the  responses  to  the  two  stimuli,  differ¬ 
ent  periodicities  of  response  variations  —  correspond¬ 
ing  to  the  filter  bands  activated  most  by  the  two  stim¬ 
uli,  as  discussed  in  section  3.1  —  and  edge  effects  are 
observed,  invariance  ranges,  i.e.,  the  range  of  positions 
that  can  influence  the  VTU's  responses,  are  approxi¬ 
mately  equal.  This  indicates  that  the  stimulus  regions 
that  maximally  activate  the  different  S2  features  do  not 
cluster  at  certain  positions  within  the  stimulus,  but  are 
fairly  evenly  distributed,  as  mentioned  above  in  sec¬ 
tion  3.2.  Panel  (b),  however,  shows  differing  invariance 
properties  of  another  VTU  tuned  to  a  car  stimulus  when 
presented  with  its  preferred  stimulus  or  a  paperclip,  for 
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Figure  11:  HMAX  activities  in  different  filter  bands. 
Plot  shows  filter  band  source  of  C2  acfivify  for  8  car  pro- 
fof5^es  and  8  randomly  selecfed  paperclips,  all  of  size 
64  X  64  pixels.  Pilfer  band  1:  fibers  7x7  and  9x9  pixels; 
filler  band  2:  from  11  x  11  fo  15  x  15  pixels;  filler  band 
3:  from  17  x  17  fo  21  x  21  pixels;  filler  band  4:  from 
23  X  23  fo  29  x  29  pixels.  Each  bar  indicafes  fhe  mean 
percenfage  of  C2  unifs  fhaf  derived  fheir  acfivify  from 
fhe  corresponding  filler  band  when  a  car  or  paperclip 
sfimulus  was  shown.  The  filler  band  fhaf  defermines 
fhe  response  of  a  C2  unif  confains  fhe  mosf  active  unifs 
among  fhose  selective  fo  fhaf  C2  unif's  preferred  fea- 
fure.  If  indicafes  fhe  size  af  which  fhis  feafure  occurred 
in  fhe  image. 

sfimuli  varying  in  size.  While  fhis  VTU's  maximum  re¬ 
sponse  level  is  reached  wifh  ifs  preferred  sfimulus,  ifs 
response  is  more  invarianf  when  probed  wifh  a  paper¬ 
clip  sfimulus,  due  fo  fhe  choice  of  filler  sizes  in  HMAX 
and  fhe  filler  bands  acfivafed  mosf  by  fhe  differenf  sfim¬ 
uli.  (Smaller  invariance  ranges  of  VTU  responses  for 
nonpreferred  sfimuli  are  also  observed,  buf  usually  in¬ 
variance  ranges  for  preferred  and  nonpreferred  sfimuli 
are  nof  fhe  same.)  This  suggesfs  fhaf  dafa  from  a  phys¬ 
iology  experimenf  abouf  response  invariances  of  a  neu¬ 
ron,  which  are  usually  nof  collecfed  wifh  fhe  sfimulus 
fhe  neuron  is  acfually  funed  fo  [7],  mighf  give  mislead¬ 
ing  information  abouf  ifs  acfual  response  invariances, 
since  fhese  can  depend  on  fhe  sfimulus  used  fo  map  re¬ 
ceptive  field  properties. 

4  Discussion 

The  abilify  fo  recognize  objecfs  wifh  a  high  degree  of 
accuracy  despife  variations  of  fheir  parficular  position 
and  scale  on  fhe  retina  is  one  of  fhe  major  accomplish- 
menfs  of  fhe  visual  sysfem.  Key  fo  fhis  achievemenf 
may  be  fhe  hierarchical  sfrucfure  of  fhe  visual  sysfem. 


in  which  neurons  wifh  more  complex  response  proper¬ 
ties  {i.e.,  responding  fo  more  complex  feafures  or  show¬ 
ing  a  higher  degree  of  invariance)  resulf  from  fhe  com¬ 
bination  of  outpufs  of  neurons  wifh  simpler  response 
properties  [15].  If  is  an  open  quesfion  how  fhe  parame- 
fers  of  fhe  hierarchy  influence  fhe  invariance  and  shape- 
funing  properfies  of  neurons  in  IT.  In  fhis  paper,  we 
have  sfudied  fhe  performance  of  a  hierarchical  model  of 
objecf  recognifion  in  corfex,  fhe  HMAX  model,  on  fasks 
involving  changes  in  sfimulus  position  and  size  using 
absfracf  (paperclips)  and  more  "nafural"  sfimuli  (cars). 
Invarianf  recognifion  is  achieved  in  HMAX  by  pool¬ 
ing  over  model  unifs  fhaf  are  sensifive  fo  fhe  same  fea¬ 
fure,  buf  af  differenf  sizes  and  positions  —  an  approach 
which  is  generally  considered  key  fo  fhe  consfrucfion  of 
complex  receptive  fields  in  visual  corfex  [2, 3, 10, 13, 20]. 
Pooling  in  HMAX  is  done  by  fhe  MAX  operation,  which 
preserves  feafure  specificify  while  increasing  invariance 
range  ]15]. 

A  simple  yef  insfrucfive  solution  fo  fhe  problem  of 
invarianf  recognifion  consisfs  of  a  defecfor  for  each  ob¬ 
jecf,  af  each  scale  and  each  position.  While  appealing  in 
ifs  simplicify,  such  a  model  suffers  from  a  combinaforial 
explosion  of  fhe  number  of  cells  —  for  each  addifional 
objecf  fo  be  recognized,  anofher  sef  of  cells  would  be  re¬ 
quired  —  and  from  ifs  lack  of  generalizing  power:  If  an 
objecf  had  only  been  learned  af  one  scale  and  position 
in  fhis  sysfem,  recognifion  would  nof  fransfer  fo  ofher 
scales  and  positions. 

The  observed  invariance  ranges  of  IT  cells  affer  fram¬ 
ing  wifh  one  view  are  reflecfed  in  fhe  archifecfure  used 
in  HMAX  (see  ]14]):  One  of  ifs  underlying  ideas  is  fhaf 
invariance  and  feafure  specificify  have  fo  grow  in  a  hi¬ 
erarchy  so  fhaf  view-funed  cells  af  higher  levels  show 
sizeable  invariance  ranges  even  affer  framing  wifh  only 
one  view,  as  a  resulf  of  fhe  invariance  properfies  of  fhe 
afferenf  unifs.  The  key  concepf  is  fo  sfarf  wifh  sim¬ 
ple  localized  feafures  —  since  fhe  discriminafory  power 
of  simple  feafures  is  low,  fhe  invariance  range  has  fo 
be  kepf  correspondingly  low  fo  avoid  fhe  cells  being 
acfivafed  indiscriminafely.  As  feafure  complexify  and 
fhus  discriminafory  power  grow,  fhe  invariance  range, 
i.e.,  fhe  size  of  fhe  recepfive  field,  can  be  increased  as 
well.  Thus,  loosely  speaking,  feafure  specificify  and 
invariance  range  are  inversely  relafed,  which  is  one  of 
fhe  reasons  fhe  model  avoids  a  combinaforial  explosion 
in  fhe  number  of  cells  —  while  fhere  are  more  differ¬ 
enf  feafures  in  higher  layers,  fhere  do  nof  have  fo  be  as 
many  unifs  responding  fo  fhese  feafures  as  in  lower  lay¬ 
ers  since  higher-layer  unifs  have  bigger  recepfive  fields 
and  respond  fo  a  greafer  range  of  scales. 

This  hierarchical  buildup  of  invariance  and  feafure 
specificify  greafly  reduces  fhe  overall  number  of  cells 
required  fo  represenf  addifional  objecfs  in  fhe  model: 
The  firsf  layer  confains  a  liffle  more  fhan  one  millions 
cells  (160  X  160  pixels,  af  four  orienfafions  and  12  scales 
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Figure  12:  Responses  of  VTUs  to  stimuli  of  different  classes  at  varying  positions  and  sizes,  (a)  Responses  of  a  VTU 
tuned  to  a  centered  car  protot5^e  to  its  preferred  stimulus  and  a  paperclip  stimulus  at  varying  positions  on  the 
horizontal  midline  within  the  receptive  field  of  its  C2  afferents,  including  positions  partly  or  completely  beyond 
the  receptive  field  (smallest  and  largest  position  values).  Size  of  stimuli  64  x  64  pixels,  image  size  100  x  100  pixels, 
40  C2  afferents  to  the  VTU,  a  =  0.2.  (b)  Responses  of  a  VTU  tuned  to  a  car  protot5^e  (size  64  x  64  pixels)  to  its 
preferred  stimulus  and  a  paperclip  stimulus  at  varying  sizes.  Stimuli  centered  within  a  160  x  160  pixel  image,  40 
C2  afferents  to  the  VTU,  a  =  0.2. 


each).  The  crucial  observation  is  that  if  additional  ob¬ 
jects  are  to  be  recognized  irrespective  of  scale  and  posi¬ 
tion,  the  addition  of  only  one  unit,  in  the  top  layer,  with 
cormections  to  the  (256)  C2  units,  is  required. 

However,  we  have  shown  that  discretization  of  the 
input  image  and  hierarchical  buildup  of  features  yield 
a  model  response  that  is  not  completely  independent  of 
stimulus  position:  A  "feature  loss"  may  occur  if  a  stim¬ 
ulus  is  shifted  away  from  its  original  position.  This  is 
not  too  surprising  since  there  are  more  possible  stimu¬ 
lus  positions  than  model  units.  We  presented  a  quanti¬ 
tative  analysis  of  the  changes  occurring  in  HMAX  out¬ 
put  and  recognition  performance  in  terms  of  stimu¬ 
lus  position  and  parameter  settings.  Most  significantly, 
there  is  no  feature  loss  if  a  stimulus  is  moved  to  an 
equivalent  position  with  respect  to  the  discrete  organi¬ 
zation  of  the  Cl  and  S2  model  units.  Equivalent  posi¬ 
tions  are  separated  by  a  distance  that  is  the  least  com¬ 
mon  multiple  of  the  (poolRange/cl  Overlap)  ratio  for  the 
different  filter  bands  used. 

It  should  be  noted  that  the  features  affected  by  feature 
loss  are  the  composite  features  generated  at  the  level  of 
S2  units,  not  the  simple  Cl  features  -  if  only  four  out¬ 
put  (S2/C2)  unit  t5q)es  with  the  same  feature  sensitivity 
as  Cl  units  are  used,  no  modulations  of  activity  with 
stimulus  position  are  observed  (as  in  the  "10  feature" 
version  of  HMAX  in  [15]).  As  opposed  to  a  composite 
feature,  a  simple  feature  will  of  course  never  go  unde¬ 
tected  by  the  Cl  units  as  long  as  there  is  some  overlap 
between  them  (Figure  5). 

The  basic  mechanisms  responsible  for  variations  in 
HMAX  output  due  to  changes  in  stimulus  position  are 
independent  of  the  particular  stimuli  used.  We  showed 
that  feature  loss  occurs  for  cars  as  well  as  for  paperclips. 


and  that  it  follows  the  same  principles  in  both  cases. 
However,  we  found  that  while  HMAX  performs  very 
well  at  size-invariant  recognition  of  paperclips,  its  per¬ 
formance  is  much  worse  for  cars.  This  discrepancy  re¬ 
lates  to  the  more  limited  availability  of  model  units  with 
different  receptive  field  sizes,  as  compared  to  units  at 
different  positions,  in  the  current  HMAX  model,  as  well 
as  to  the  particular  feature  dictionary  used  in  HMAX. 
The  model's  high  performance  for  recognition  of  paper¬ 
clips  derives  from  the  fact  that  its  feature  detectors  are 
well-matched  to  this  object  class  —  they  closely  resem¬ 
ble  actual  paperclip  features,  and  the  size  of  the  detec¬ 
tors  activated  most  by  a  stimulus  is  in  good  agreement 
with  stimulus  size.  Especially  the  latter  is  important  for 
the  model  to  take  advantage  of  its  different  filter  sizes 
for  detection  of  features  regardless  of  their  size.  This 
correspondence  between  stimulus  size  and  size  of  the 
most  active  feature  detectors  is  not  given  for  cars;  hence 
the  model's  low  performance  at  size-invariant  recogni¬ 
tion  for  this  object  class. 

These  findings  show  that  invariant  recognition  per¬ 
formance  can  differ  for  different  stimuli  depending  on 
how  well  the  object  recognition  system's  features  match 
the  stimuli.  Our  simulation  results  suggest  that  invari¬ 
ance  ranges  of  a  particular  neuron  might  depend  on  the 
shape  of  the  stimuli  used  to  probe  it.  This  is  especially 
relevant  as  most  experimental  studies  only  test  scale  or 
position  invariance  using  a  single  object,  which  in  gen¬ 
eral  is  not  identical  to  the  object  the  neuron  is  actually 
tuned  to  [7]  (the  "preferred"  object).  Thus,  invariance 
ranges  calculated  based  on  the  responses  to  just  one  ob¬ 
ject  are  possibly  different  from  the  actual  values  that 
would  be  obtained  with  the  preferred  object  (assuming 
that  the  neuron  receives  input  from  neurons  in  lower  ar- 
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eas  tuned  to  the  features  relevant  to  the  preferred  objecf 

[14]). 

If  is  imporfanf  fo  nofe  fhaf  for  any  changes  in  HMAX 
response  due  fo  changes  in  stimulus  posifion  or  size, 
drops  in  VTU  oufpuf  are  nof  necessarily  accompa¬ 
nied  by  drops  in  recognition  performance.  Recogni¬ 
tion  performance  depends  on  relafive  VTU  acfivify  and 
fhus  on  number  and  characferisfic  feafures  of  disfrac- 
fors,  and  if  can  remain  high  even  for  drasfically  re¬ 
duced  absolufe  VTU  acfivify  (see  [14]).  Thus,  size-  and 
posifion-invarianf  recognition  of  objecfs  does  nof  re¬ 
quire  a  model  response  fhaf  is  independenf  of  sfimulus 
size  and  posifion.  Furfhermore,  as  Figure  6  shows,  fhe 
magnifude  of  flucfuafions  can  be  confrolled  by  varying 
fhe  paramefers  fhaf  confrol  pooling  range  and  recepfive 
field  overlap  in  fhe  hierarchy.  If  will  be  inferesfing  fo 
examine  whefher  one  can  derive  consfrainfs  for  fhese 
variables  from  fhe  physiology  liferafure.  For  insfance, 
recenf  resulfs  on  fhe  recepfive  field  profiles  of  IT  neu¬ 
rons  [12]  suggesf  fhaf  fhe  majorify  of  IT  neurons  have 
Gaussian,  i.e.,  unimodal,  profiles.  In  fhe  framework  of 
our  model  fhis  corresponds  fo  a  periodicify  of  VTU  re¬ 
sponse  wifh  a  wavelengfh  which  is  greafer  fhan  fhe  size 
of  fhe  recepfive  field.  This  would  argue  (Eq.  1)  for  eifher 
low  values  of  Cl  overlap  or  high  values  of  fhe  pooling 
range.  The  observafions  fhaf  fhe  average  linear  exfenf 
of  a  complex  cell  recepfive  field  is  1.5-2  times  fhaf  of 
simple  cells  [4]  (greafer  fhan  fhe  pooling  ranges  in  fhe 
sfandard  version  of  fhe  model)  is  compatible  wifh  fhis 
requiremenf.  Clearly,  more  defailed  dafa  on  fhe  shape 
funing  of  neurons  in  infermediafe  visual  areas,  such  as 
V4,  are  needed  fo  quanfifafively  fesf  fhis  h5q)ofhesis. 
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